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ABSTRACT 

Large datasets ("Big Data") are becoming ubiquitous be- 
cause the potential value in deriving insights from data, 
across a wide range of business and scientific applications, is 
increasingly recognized. The data growth has been accom- 
panied by rapid adoption of large, elastic, multi-tenanted 
computing clusters ("compute clouds"), leading to a virtu- 
ous cycle: the scalability of cloud computing makes it possi- 
ble to analyze ever larger datasets, and the proliferation of 
Big Data leads to further adoption of cloud computing. In 
particular, machine learning — one of the foundational dis- 
ciplines for data analysis, summarization and inference — on 
Big Data has become routine at most organizations that op- 
erate large clouds, usually based on systems such as Hadoop 
that support the MapReduce programming paradigm. It is 
now widely recognized that while MapReduce is highly scal- 
able, it suffers from a critical weakness for machine learn- 
ing: it does not support iteration. Consequently, one has 
to program around this limitation, leading to fragile, inef- 
ficient code. Further, reliance on the programmer is inher- 
ently flawed in a multi-tenanted cloud environment, since 
the programmer does not have visibility into the state of 
the system when his or her program executes. Prior work 
has sought to address this problem by either developing spe- 
cialized systems aimed at stylized applications, or by aug- 
menting MapReduce with ad hoc support for saving state 
across iterations (driven by an external loop). In this paper, 
we advocate support for looping as a first-class construct, 
and propose an extension of the MapReduce programming 
paradigm called Iterative MapReduce. We then develop an 
optimizer for a class of Iterative MapReduce programs that 
cover most machine learning techniques, provide theoretical 
justifications for the key optimization steps, and empirically 
demonstrate that system-optimized programs for significant 
machine learning tasks are competitive with state-of-the-art 
specialized solutions. 

General Terms 

Systems, Machine Learning 



1. INTRODUCTION 

The volume of data is skyrocketing as organizations recog- 
nize the potential value of data-driven approaches to opti- 
mizing every aspect of their operation, and scientific disci- 
plines ranging from astronomy to zoology become increas- 
ingly data-centric in everything from hypothesis formulation 
to theory validation. Large scale analytics are a key to de- 
riving insight from this deluge of data, and machine learning 
(ML) is now established as a foundational discipline that is 
ever more valuable as datasets grow larger 1 . For exam- 
ple, by analyzing billions of transactions, credit-card compa- 
nies are able to quickly identify stolen credit card; insurance 
companies derive can flag claims for possible fraud. Super- 
markets derive promotions based on consumer purchases. 

The sheer size of today's data sets far exceeds the capacity of 
a single machine. Big Data analytics platforms based on the 
MapReduce paradigm, such as Hadoop, have enabled statis- 
tical queries over large data, and many ML algorithms can 
be cast in terms of these queries [9j|7]. However, MapReduce 
fails to recognize the iterative nature of most ML algorithms, 
and due to this unfortunate omission, while ML computa- 
tions can be expressed using MapReduce, execution over- 
heads are significantly higher than in Message Passing In- 
terface or algorithm-specific implementations (e.g. 14] |13| ). 

Failing to recognize iteration as a first class programming 
abstraction is a step backwards, as it forces the programmer 
to make systems-level decisions. For example, in Spark |T5] 
the programmer has to decide what data to cache in dis- 
tributed main memory. This approach is ill-suited for large, 
multi-tenant clusters such as public clouds where important 
performance-related parameters change constantly and in a 
way that is hard for a programmer to track. In address- 
ing this challenge, we draw our inspiration from database 
systems, where the level of abstraction introduced by the 
relational model freed users from low-level systems consider- 
ations, and opened the door to DBMS-driven optimization. 

In this paper, we present extend the MapReduce paradigm 
with support for iteration, and present a principled frame- 
work for optimizing the runtime of systems such as Hadoop 
to efficiently support Iterative MapReduce programs. To 
this end, we make the following contributions: 



1. Iterative MapReduce: We formalize the Iterative 
MapReduce programming model, and describe how 
many recent proposals to support ML over Big Data 



can be expressed readily in this model. (Section |2| 

2. Runtime: We present a new runtime for Iterative 
Map Reduce. (Section |4| 

3. Optimizer: We develop an optimizer that picks a 
good runtime plan when given data, program and clus- 
ter parameters. In particular, we consider two key 
choices: the partitioning strategy for the training data, 
and the structure of the aggregation that is applied 
to the intermediate statistics produced by the com- 
putation. We argue that these are the only tunable 
knobs since the computation itself (the logic of the 
Map and Reduce steps) is opaque, and present a the- 
oretical foundation for our optimizer. (Section [sj 

4. Empirical study: We empirically validate both our 
optimizer and our runtime, the latter by demonstrat- 
ing that it can outperform a state of the art system, 
VW [l]. (Section [6]) 

2. ITERATIVE MAPREDUCE 

2.1 Background: MapReduce 

MapReduce is a functional programming model that splits 
the traditional group-by- aggregation computation into two 
steps: map and reduce 8^. The (user-specified, opaque) map 
step is responsible for transforming the input into key-value 
record pairs. The key identifies the group to which the value 
belongs; all values associated with the same key are grouped 
together. The (also user-specified and opaque) reduce step 
is then used to process each group and produce the final 
result. The computation associated with the reduce step 
is commonly an aggregate function (e.g., sum, max, mean, 
etc.), which produces a scalar value for each group. 

The MapReduce programming model has been used to im- 
plement many higher-level programming abstractions. Pig 
Latin [llj and Hive |4] both provide a SQL layer with some 
notable extensions (e.g., correlated sub-queries) on top of the 
Hadoop MapReduce runtime. Such higher- level abstractions 
allow programmers to express their computations in a form 
that is closer to an intended target domain (e.g., data ana- 
lytics). In our work, we have built a higher- level abstraction 
for machine learning using MapReduce called ScalOps [T2|, 
which is a Scala domain-specific language (DSL) that uses 
Pig Latin like syntax. 

2.2 Iterative MapReduce 

Many machine learning algorithms can be expressed as it- 
erative procedures refining the model, given training data. 
More to the point, the body of these iterations can be ex- 
pressed solely in terms of statistical queries |9j over the train- 
ing data such as min, max, mean and sums; these queries 
can be naturally computed in MapReduce. This insight was 
used by Chu et al. j7j to express effective parallel versions of 
several machine learning algorithms (e.g., backpropagation 
in neural networks, EM, logistic regression, linear SVMs, 
PCA) relying only on sums over functions applied to the 
data. 

Inspired by these earlier results and building on our own 
work towards a more general programming interface for cloud- 
based Big Data analysis [13^ 12^ , we introduce an extension 



of the MapReduce programming paradigm, called Iterative 
MapReduce, that supports iteration as a fundamental con- 
struct. Iterative MapReduce is defined in terms of a collec- 
tion of operators that can be composed to create dataflow 
programs. Each Operator accepts an input and produces 
an output. Chaining operators therefore is the main com- 
position method in Iterative MapReduce. The computation 
itself is expressed in these three key operators: 



MapReduce: This operator has two inputs: the data set and 
side information that it makes available to the user de- 
fined map and reduce functions it hosts. The map func- 
tion is applied to all records in the immutable input 
data and the reduce function is applied to aggregate 
the outputs of that process. We define reduce in the 
sense typically found in functional programming lan- 
guages: It is a associative and cumulative function that 
accepts two inputs and reduces them to a single out- 
put. Section |4] looks at how we can parallelize this step 
over a cluster of machines. 

Sequential: This operator accepts a single input, and pro- 
duces a single output using the user defined function it 
hosts. Separating such code from the MapReduce opera- 
tor allows us to ensure an associative and commutative 
reduce function. 

Loop: This is a fundamental extension to the basic MapRe- 
duce paradigm. As in most programming languages, 
our Loop operator accepts three inputs: a body, a con- 
dition and an initializer. The body contains a chain of 
MapReduce and Sequential operators. The output of one 
forms the input of the next operator in this chain. We 
require that the output of the last operator is valid in- 
put to both the loop condition (see below) and the first 
operator of the chain. The condition accepts the loop 
body's output as input and returns a boolean value in- 
dicating whether the loop should terminate, while the 
initializer is used to provide an initial input for the 
loop body. 



Many programs can be expressed using these three opera- 
tors. Trivially, they facilitate the construction of loops over 
sequential code. More importantly, they allow us to write 
iterative ML algorithms without recourse to external mech- 
anisms (in particular, without using a top-level driver that 
invokes MapReduce within a loop, but is not visible to the 
MapReduce system). To express most iterative ML algo- 
rithms, the loop body would consist of a single MapReduce 
operator that computes the relevant statistics, using the cur- 
rent model state as an input. This would be followed by a 
Sequential operator that updates the model. 

While we discuss this special case extensively due to its im- 
portance in the machine learning domain, we note that the 
Iterative MapReduce programming model is in fact more 
general, and supports loops over multiple MapReduce oper- 
ators as well as loops over any sequence of MapReduce and 
Sequential operators. This, for example, allows facilitates 
the native expression of optimization algorithms that probe 
multiple possible gradient step sizes. 





ContinueO 
Loop 




1 


' 1 



(Model, Performance) 



Training 




Data 
^ ^ 





Map() ReduceQ 
MapReduce 



Aggregate 



Statistics 



UpdateO 
Sequential 



Figure 1: Iterative MapReduce Dataflow for ML. 



The situation for the important special case is depicted in 
Figure [l] from a data-flow perspective. The arrows indicate 
control flow, which carry data from one step to the next. 
The Loop operator drives each iteration, until some stop- 
ping condition is met. It is also responsible for producing 
the initial model. In a single iteration, the MapReduce 
operator accepts the current model and uses it to process 
the training data, and produce an aggregate statistic. The 
Sequential step uses the aggregate statistic to update the 
model, before returning control to the Loop for a (possible) 
subsequent iteration. 

While we are not the first to recognize the need for support- 
ing iteration in MapReduce, we are the first to explore the 
consequences of adding iteration as a fundamental construct 
in the MapReduce system, and in particular to demonstrate 
the opportunities for system-driven program optimization. 
Prior work is focused on assembling ML algorithms within 
a specialized runtime [l] |10[ [15] targeted at specific applica- 
tions, or invoking a general purpose MapReduce engine 
In contrast, we have developed the Iterative MapReduce pro- 
gramming model with ML-style programs in mind (the Loop 
operator is especially noteworthy), and (in Section |5| de- 
velop an optimizer that can translate a broad class of pro- 
grams in this model (covering most ML programs) to an 
efficient runtime execution plan for an arbitrary cluster en- 
vironment. The systems-driven optimization enabled by our 
approach is especially valuable in multi-tenanted and elastic 
cloud systems, whose rapidly changing resource availability 
makes it difficult if not impossible for programmers to man- 
ually configure their programs effectively. 

3. RELATED WORK 

In translating programs from our programming model to 
efficient runtime plans, we seek to exploit optimizations dis- 
covered in prior work, which we review in this section. 

Hadoop [3] is the dominant Open Source Software im- 
plementation that supports the MapReduce programming 
model [8^. A Hadoop job executes a single MapReduce it- 
eration. The input and output of the job is stored in a 
distributed filesystem (HDFS). A job consists of a map and 
reduce step, which are parallelized over many tasks. Hadoop 
tries to schedule map tasks on machines that host the in- 
put data, so the number of map tasks is data dependent. 
The number of reduce tasks is a job parameter, set by the 
programmer. The intermediate data produced by the map 
tasks and consumed by the reduce tasks is managed by the 



Hadoop runtime, which uses a sort-based implementation to 
perform the group-by operation. The Hadoop API also ex- 
poses a "combiner" function that supports pre-aggregation of 
this intermediate data. Hadoop does not have support for a 
loop step. Instead, an external driver must implement such 
a step by repeatedly submitting jobs to the Hadoop runtime. 
Each job executes in isolation and any information produced 
by the previous job is fed to the new job through back chan- 
nels (i.e., the HDFS file system). Lastly, the training data 
must be re-read from its source (i.e., HDFS), forgoing the 
benefits of caching. 

HaLoop |6| exposes an application programming interface 
that supports iterations in Hadoop MapReduce. The exten- 
sion adds a loop control module to the Hadoop master node 
that repeatedly spawns new jobs based on a loop body, un- 
til come stopping condition is met. HaLoop also adds cache 
aware scheduler to Hadoop that colocates map tasks with 
the reduce task that produces its input. 

MPI Launchers (e.g., [l4]) address the need for an explicit 
loop step, and by doing so, avoid the scheduling overheads 
observed in Hadoop. Pregel 10 and Giraph 2 are two 
recent runtimes that support a message passing interface 
(MPI) programming model. Both systems expose an API 
for loading and caching input data. The map step is auto- 
matically fed the output of the prior iteration, usually in the 
form of messages. The reduce step is supported by global 
"aggregators." 

Worker- Aggregator |13| defined by Weimer et al., is a 
distributed main memory implementation that uses a flat ag- 
gregation hierarchy with a single aggregator task with direct 
network connections. The system outperforms Hadoop by 
an order of magnitude on a stochastic gradient descent (SGD) 
algorithm. This speedup is in line with earlier MPI re- 
sults ^3. The authors point to a rather unorthodox han- 
dling of failures: As the algorithm evaluated (SGD) is in- 
herently stochastic in its data access, machine failures can 
simply be ignored, as long as they occur independently of 
the data stored on those machines. 

Vowpal Wabbit (VW) [LJ is a scalable machine learn- 
ing system that integrates the machine learning algorithm (s) 
into the runtime. The system includes a Hadoop-aware ver- 
sion of the allreduce function found in MPI. The system 
is highly optimized for fast iterations. A cache aware data 
format is used to speed up the map step, and a binary aggre- 
gation tree is a key optimization used to speed up the re- 
duce step. Task re-scheduling is avoided between iterations 
and communication happens via direct network connections. 
The authors observe an order of magnitude speedup when 
comparing with stock Hadoop. 

Spark 1 15 1 is a runtime built on a data abstraction called re- 
silient distributed datasets (RDDs) that reference immutable 
data collections. Spark also provides a domain-specific lan- 
guage (DSL) that consists of standard relational algebra 
transformations (select, project, join) and actions that per- 
form global aggregation. Spark supports iterative algorithms 
that explicitly cache RDDs in-memory. Indeed, the Spark 
runtime is optimized for in-memory computation only. Spark 
has published speed-ups of 30 x over stock Hadoop. 



3.1 Discussion 

Many of the approaches described above claim order of mag- 
nitude speedups over stock Hadoop when performing Itera- 
tive Map Reduce. These runtimes share several characteris- 
tics in order to accomplish this goal. They avoid reschedul- 
ing of machines between iterations, cache partitioned data 
between iterations, and use more powerful forms of aggre- 
gation between map and reduce steps. However, these im- 
provements have not been cast in a form that can be ex- 
ploited on an arbitrary cluster environment. To do so re- 
quires us to capture all significant aspects of the computa- 
tion including iteration in the programming model; develop 
a formalization of the plan space, including a definition of 
runtime operations and key parameters such as partition 
width and aggregation tree fan-in, in order to reason about 
alternative equivalent execution plans; and to build an op- 
timizer that can evaluate the cost of these alternative plans 
and choose a good plan. We have already introduced the 
Iterative MapReduce programming model, which captures 
iteration; next, we will build on this to formalize the space 
of equivalent runtime plans for a given program. After that, 
we describe the optimizer in Section [5] 

4. PHYSICAL PLAN 

This section presents the physical plans that execute our It- 
erative MapReduce programming model on a cluster of ma- 
chines. For concreteness, we consider the Iterative MapRe- 
duce dataflow shown in Figure [l] and discuss a plan template 
for it: the space of equivalent plans is realized by instantiat- 
ing this template with different plan parameter values. Our 
implementation uses the Hyracks runtime 5 , and a plan 
consists of dataflow processing elements, or Hyracks oper- 
ators, that execute in the Hyracks runtime. Hyracks splits 
each Hyracks operator into multiple tasks that execute in 
parallel on a distributed set of machines. Similar to Hadoop, 
each task operates on a single partition of the input data. In 
Section |4.1| we describe the structure of the physical plan 
template and discuss its tunable parameters. Section |4.2| 
then explores the space of choices that can be made when 
executing this physical plan on an arbitrary cluster with 
given resources and input data. 

4.1 Iterative MapReduce Physical Plan 

Figure |2] depicts the physical plan template for the Iterative 
MapReduce dataflow in Figure [l] as two data-flows. The top 
dataflow loads the input data from HDFS, parses it into an 
internal representation (e.g., binary formated features), and 
partitions it over N cluster machines. The bottom dataflow 
executes the computation associated with a Loop operator. 
The Driver of the loop (observe that this is now controlled 
by the system, which is now aware of the entire program 
including the iteration!) is responsible for seeding the initial 
global model and driving each iteration based on the loop 
condition. The map step is parallelized across some number 
of nodes in the cluster determined by the optimizer Each 
map task sends a data statistic (e.g., a loss and gradient in a 
BGD computation) to a random reduce task participating in 
the leaf-level of an aggregation tree. This aggregation tree 
is balanced with a parameterized fan-in / (e.g., f — 2 yields 
a binary tree) determined by the optimizer. The flnal aggre- 

^ Reflected in the partitioning strategy chosen for the data 
loading step 



Data Loading /^^^^^^ 

[hdfs] ► H. ["cr] I 

Cached Records 



Iterative Computation 




Driver 
(loop) 



Figure 2: Hyracks physical plan for Iterative 
MapReduce. 



gate statistic is passed to the Sequential operator, which 
updates the global model stored in HDFS. The Driver de- 
tects this update, and applies the loop condition to the new 
model to determine if another iteration should be performed. 

This description of the plan template highlights two choices 
to be determined by the optimizer — the number of nodes al- 
located for the map phase of the computation, and the fan-in 
of the aggregation tree for the reduce phase. The structure 
of the plan template comes from consideration of the struc- 
ture of the dataflow in Figure [l] and the justiflcation for the 
focus on these two optimizer choices will be presented next. 

4.2 The Plan Space 

There are several considerations that must be taken into 
account when mapping the physical plan in Figure |2] to an 
actual cluster of machines. Many of these considerations are 
well-established techniques for executing data-parallel oper- 
ators on a cluster of machines, and are largely independent 
of the resources available and the program/dataset to be op- 
timized. We begin by discussing these "universal" optimiza- 
tions for arriving at an execution plan. Next, we examine 
those choices that are dependent on the cluster conflgura- 
tion (i.e., amount of resources) and computation parameters 
(i.e., input data and aggregate value sizes). These are the 
choices an optimizer must make for a given program and 
input dataset in the context of a given cluster and current 
workload. 

4.2.1 Universal Optimizations 

Data-local scheduling is generally considered an optimal 
choice for executing a dataflow of operators in a cluster envi- 
ronment: a map task is therefore scheduled on the machine 
that hosts its input data. Loop-aware scheduling en- 
sures that the task state is preserved across iterations. Note 
that this is not the same as blocking machines, as is done 
in VW [l] . Rather, we want to avoid costly re-optimization 
per-iteration, taking advantage of the similarity between it- 
erations. Caching of immutable data can offer signiflcant 
speed-ups between iterations. However, careful considera- 



tion is required when the available resources do not allow for 
such caching. For example, it is assumed in flS^ that suffi- 
cient main memory is always available to cache the data to 
be saved across iterations, and performance degrades rapidly 
when this assumption does not hold. Efficient data seri- 
alization can offer significant performance improvements. 
We use a binary formated file, which has substantial bene- 
fits in terms of space and time over simple Java objects, to 
store our cached records. 

4.2.2 Per-Program Optimizer Decisions 
The optimizations discussed in Section |4.2.1| apply equally 
to all jobs and can be considered best practices inspired by 
the best-performing systems in the literature. This leaves us 
with two optimization decisions that are dependent on the 
cluster and computation parameters; we discuss them below. 
In the next section, we develop a theoretical foundation for 
an optimizer that can make these choices effectively. 

Data partitioning determines the number of map tasks in 
an Iterative Map Reduce physical plan. For a given job and 
a maximum number Nmax of machines available to it, the 
optimizer needs to decide which number A/" <= Nmax of ma- 
chines to request for the job. The decision is not trivial, 
even ignoring the multi-job nature of today's clusters: More 
machines reduce the time in the map phase but increase the 
cost of the reduce phase, since more objects need to be ag- 
gregated. The goal of data partitioning is to find the right 
trade-off between map and reduce costs. 

Aggregation tree structure involves finding the optimal 
fan-in of a single reduce node in a balanced aggregation tree. 
Aggregation trees are commonly used to parallelize the re- 
duce function. For example, Hadoop uses a combiner inter- 
face to perform a single level aggregation tree, and Vowpal 
Wabbit uses a binary aggregation tree. In this next section, 
we develop an optimizer to decide an optimal tree structure 
for a given job based on the fan-in / of the aggregation nodes 
in the tree. 

5. RUNTIME OPTIMIZATION 

After factoring out optimizations that are universal in na- 
ture, the optimizer needs to answer two crucial questions for 
a given job in a given shared cluster environment: (a) How 
many machines should we devote to the task? (b) What fan- 
in / should we use for the aggregation tree phase? In answer- 
ing these questions an optimizer can consider two different 
objectives: (a) Minimize the response time (wall-clock time) 
for the program, (b) Minimize the cost of the job. Here, 
we consider machine time as a proxy for cost. While many 
other metrics are conceivable in principle, public clouds such 
as Amazon EC 2 have opted for machine time, which makes 
it the prime candidate for minimization. 

Below, we present our theoretical findings for these ques- 
tions for two cases. First, we show that the optimal fan-in 
of the aggregation tree is independent of both the cluster 
and the job. We use this result to design the optimal par- 
titioning for two cases: (a) The per-record processing time 
is independent of the number of machines used; this is the 
case for systems where either all records are read from disk 
(e.g., Hadoop) or all records are held in distributed main 
memory (e.g., Spark), (b) Caching influences the time to 



Symbol Meaning 

R total # records 

Nmax Max # CPUs 

M # records cached per CPU 

P Map time per record 

D Load time per record 

A Aggregation time per object 

Table 1: Symbols used in the derivations 

access/process a record, which is at the heart of Iterative 
MapReduce optimization. 

As before, we consider the following simple program ex- 
pressible in our programming model: A Loop containing a 
single MapReduce operator followed by a Sequential operator. 
The time spent in the Sequential operator and the iteration 
control are small relative to the time spent on MapReduce, 
hence the optimizer needs only to consider the time spent 
inside of MapReduce operator. 

We assume that both our network and computation behave 
linearly: If we invoke a UDF twice as often, we assume that 
it will take twice as long. We assume that data transmis- 
sion to/from a machine behaves linearly. When a machine 
sends or receives data it does so sequentially. Both of these 
assumptions can be violated in real world clusters under ex- 
treme load. However, they represent the behavior within 
the optimal load region of the cluster. 

These assumptions allow us to use the notation found in 
Table [l] to express our model for both the iteration time 
and cost. M, P and D can be measured for a given cluster 
and job and R is known for a job. 

Lastly, we assume both the cost and the computational time 
of the MapReduce operator to be comprised additively of the 
cost (time) of the map phase and the cost (time) of the 
reduce phase. Hence, we state: 

T{NJ) = Ta{NJ) + Tm{N) 
C{NJ) = Ca{NJ) + Cm{N) 

As already stated in the equation, we assume the aggrega- 
tion time Ta and cost Ca to depend on both the fan-in / and 
the number N of machines used. The time Tm and cost Cm 
to map, on the other hand, solely depend on the number of 
machines used. Intuitively, more machines introduce greater 
parallelism but at the same time incur additional aggrega- 
tion time and cost. 

In the remainder of this section, we present theoretically op- 
timal choices for the fan-in / and the number of machines N 
to be used, starting with the fan-in. 

5.1 Optimal Aggregation Tree Fan-In 

Theorem 1. The fan-in of the fastest aggregation tree is: 
f = e 

Proof. The time it takes to aggregate N inputs in an 
aggregation tree of fan-in / can be phrased as: 

TA{N,f) = Afh{N,f) 



where h{N, f) is the number of levels in the tree. Aggre- 
gation happens in parallel at each level. Hence, the time 
per level is the time spent in a single aggregation node, 
Af. The height of a tree with N leaf nodes and arity / 
is h{N, f) = \ogf n — . Hence, we arrive at: 



/ = argmin (^^^^(iV)^ 



□ 



Corollary 1. The minimal time process N inputs in a 
balanced aggregation tree is: 

Ta^N) = Aeln(iV) 



Intuition: The independence of the number of inputs is 

easy to see: the difference between the optimal aggregation 
tree for a small vs. large number of leaf nodes is sheer scal- 
ing, a process for which the arity of the tree does not change. 
The independence of the transfer and aggregation time A is 
similarly intuitive, as the time spent per aggregation tree 
level and the number of levels balance each other out. 

Now we consider cost-optimal aggregation trees. First, we 
discuss the static case where the MapReduce operator is not 
part of a Loop. 



Theorem 2. The cost-optimal fan-in for the reduce phase 
of a MapReduce operator is N . 



5.2 Optimal Partitioning 

We use this model to study the optimal choice for N . In 
Iterative MapReduce, this choice is complicated by caching 
effects when compared to MapReduce: Our physical plan 
makes sure that as much of the training data stays available 
in main memory of the machines as possible, which speeds 
up all but the first iteration. However, it is neither guaran- 
teed that all data can fit into the aggregate main memory 
of a cluster, nor that that solution is optimal in terms of 
response time or cost. Thus, an optimizer must consider 
these two distinct possibilities: (a) the optimal N is the one 
where all data fits into the collective main memory, that is 
R < MN. (b) Some of the data is spilled to disk, R > MN. 

5.2.1 Response Time Minimization 

Theorem 4. Let R < MN. The time- optimal number 
of machines for the map phase of a MapReduce operator is: 



N = 



RP 

Ae 



Proof. The map phase is perfectly parallel. Hence, the 
total processing time is given by: 

T(n) = ^P + Ae\n{N) 
This is minimized for 

N = argmin —P -h Ae\n{N) = argmin -— -h In(iV) 
N \N J N \N 

where W = ^ . This is minimized when its first derivative 
= 0, which the case ioi N = W = □ 



Proof. Decreasing the fan-in below N introduces addi- 
tional aggregation work and doing so does not decrease the 
computational cost of the reduce operation. □ 



Consider the case where the MapReduce operator is part of a 
Loop: All machines used need to wait while the aggregation 
is running, as it is a blocking operation. 

Theorem 3. The cost-optimal fan-in for the reduce phase 
of a MapReduce operator inside of a Loop is e. 

Proof. While the aggregation tree is running, the N map 
machines are idle. The number of inner nodes in the tree 
is y^r^ which means that the cost of the idling machines 
always trumps the cost of the aggregation machines. Hence, 
the fastest aggregation tree is also cost-optimal. □ 



The above establishes that neither the time nor the cost of 
an iteration depend on the fan-in /, as we can replace it 
with its respective optimal choice of e or N. Hence, we can 
refine our cost and time model to be solely dependent on the 
number of machines used N: 

T(N) = Ta{N) + Tm(N) 
C{N) = Ca{N) + Cm{N) 



Theorem 5. For R > MN, the time-optimal number 
of machines to be used for a MapReduce operator is: 

^_ RD + RP 
~ Ae 

Proof. Processing all R input records takes RP time. 
R — MN records need to be fetched from disk, which incurs 
an additional delay of {R — MN)D. The total time for one 
iteration is thus given by: 

T2{N) = eA\n(N) + ^^^^^ MD 

The constant MD does not affect the minimizer N2 which 
is given, similarly to the analysis above for the case with no 
spiffing, for N2 = □ 

Our optimizer evaluates both TiNi and T2N2 and chooses 
the lower one for the runtime plan. 

The number of available machines in a cloud is essentially 
unbounded. At the very least, we can assume that the num- 
ber of machines available exceeds the number of machines 
needed to cache all records of a given job. Hence, the le- 
gitimate question arises whether such a in-memory solution 
can ever be slower than a solution using secondary memory. 
Below, we study this question. 



Theorem 6. Incurring disk I/O is time-efficient, if 
D 



€ (0,6^- 



1) 



Proof. The spilling configuration is better than the in- 
memory configuration when the best time for the spilling 
case is better than the best time for the in-memory case, 
i.e. 

T2(iV2) < Ti(iVi) 

, , R(D + P) , , RP 

Ae In ^ — ^ — ^ - MD < In — - 
A A 

D + P 
Ae\n ^ < MD 

Also for spilling to be necessary we know R > MN2: 

R > 

Ae 

MD < Ae- MP 
D + P 

Ae\n ^ < Ae-MP 

Hence, we arrive at: 

, D+P , MP 

The above inequality has solutions only when G (0, 1). 
Intuitively, this means that processing all in-memory records 
in, one machine must be cheaper than the time spent by 
an aggregator in receiving all its input aggregate objects. 
Hence, Equation |5 . 2 . 1 1 indicates that when 

- € (0,6^ -1) 

allowing some I/O is better than using more machines to 
facilitate a completely in-memory map task. □ 

5. 2. 2 Cost Minimization 

As before, we define cost as the time the iteration takes times 
the number of machines used. Again, we need to consider 
the two cases for whether or not all data can be held in 
distributed main memory separately. 



Theorem 7. With R < MN the cost- minimizing num- 
ber of machines to use in a MapReduce operator is: 



R^ 
M 



Proof. Following the discussion above, the iteration cost 
is given by: 

Ci{N) = eAN\n{N) + RP 

Where eAN In(A^) is the cost of the optimal aggregation tree 
in the Iterative MapReduce setting. This is minimized for 
iV = 0. However, we know that R < MN. Hence iVi = ^ 
is the minimizer within the domain of N. □ 



Theorem 8. For R < MN the cost-minimizing num- 
ber of machines to use in a MapReduce operator is 

MP 

N2 = e 



Proof. The cost is given by the cost of the fastest aggre- 
gation tree plus the cost of the map phase: 

C2{N) = eAN\n{N) - NMD + R{P + D) 

This cost is minimized for: 

argminC2(A^) d^igmine AN \n(N) NMD 

N N 

The first derivative of which is zero for N2 — e . The 
second derivative is positive, so we indeed have an opti- 
mum. □ 



Our optimizer evaluates both CiiVi and C2N2 and chooses 
the lower one for the runtime plan. 

6. EXPERIMENTAL EVALUATION 

In this section, we present our experiments that evaluate the 
optimizer described in Section [5] We compare our approach 
to Vowpal Wabbit (VW) 1 : a state of the art machine 
learning system. Our goal here is to verify the theoretical 
foundation of our optimizer as it is encoded in Hyracksj^ 
We show that the time-optimal fan-in is indeed a constant, 
and independent of the aggregation time A or the number of 
CPUs N . We present empirical evidence showing that our 
static optimizer acurately predicts the optimal strategy. 

6.1 Task 

Before presenting the results, we first introduce the chosen 
task: computing gradients for the training of a large scale 
linear model. The goal of training a linear model can be 
formalized as: 



w — argmin / ((x, w) , y) 

^ {x,y)eD 



(1) 



where D is the set of tuples of data point x and label y. 
The loss function / measures the empirical loss (divergence) 
between the prediction {w^ x) using the model w and the 
true label y. In many cases, it is convex and differentiable 
in the prediction, and therefore in the model w. Hence, the 
objective function ([T]) is amenable to convex optimization. 
More precisely, the objective function can be minimized us- 
ing gradient descent methods. Such methods, at their core, 
perform iterative steps of the following form: 



Wt 



: {{x,wt) ,y) 



(2) 



Here, 5w denotes the gradient with respect to the model w 
and Tj the step size. The dominant cost in this is computing 
the gradients, which decomposes per tuple {x,y). Hence, 
this task is amenable to MapReduce and the overall proce- 
dure to Iterative MapReduce. 

Data Set: All experiments reported here were performed 
on a real-world dataset drawn from the advertisement do- 
main. The data consists of 2,319,592,301 records whose fea- 
ture vectors y are sparse, containing a total of 37,113,474,662 
non-zero features. A textual representation of the data set 
in the format used by VW (see below) is 492 GB in size. 



^Hyracks is available as Open Source Software: https:// 
code . google . com/ p/hyracks/ 



Symbol 


Meaning 


Value 


R 


total # records 


2,319,592,301 


max 


Max # map tasks 


120 


M 


# records cached per task 


19,329,936 


P 


Map time per record 


3.895 X 10""* s 


D 


Load time per record 


w X 10~^ s 


A 


Aggregation time per object 


2.1 s 



Table 2: Characteristics of the evaluated environ- 
ment 

Cluster: All experiments were conducted on a single rack 
of 30 machines in a Yahoo! Research Cluster. Each machine 
has 2 quad-core Intel Xeon E5420 processors, 16GB RAM, 
IGbps network interface card, and four 750GB drives con- 
figured as a JBOD, and runs RHEL 5.6. Thus, each machine 
can support 4 map tasks, leavings us with Nmax = 120. The 
machines are connected to a top of rack Cisco 4948E switch. 
The connectivity between any pair of nodes in the cluster is 
IGbps. Table [2] shows the statistics of the dataset and task 
which we measured and use as input for our optimizer. 

6.2 Grounding Experiment 

We begin with an experiment that compares our optimized 
plan, executed in the Hyracks runtime system, to Vowpal 
Wabbit (VW) [l]. VW uses Hadoop to schedule a map only 
job. Each of these map tasks then downloads the textual 
data assigned to them from HDFS to the local disk in an op- 
timized binary format. The CPUs span a binary aggregation 
tree for the reduce operation. Each CPU emits one result 
into the aggregation tree, effectively pre- aggregating the per- 
CPU results. VW is the first system to achieve terra-data 
scale: It can operate on datasets with "trillions of features, 
billions of training examples and millions of parameters." 1 . 

On the complete dataset using gradients of 128MB (2^^ di- 
mensions), the average iteration time of VW is 124.41s when 
run on all 120 CPUs machines of the cluster. The average 
iteration time for Hyracks in the same configuration (us- 
ing a binary aggregation tree) is 127.42s. We performed an 
additional experiment using a fan-in of 4 and per machine 
pre-aggregation, which resulted in a average iteration time of 
114.54s. Hence, our optimized plan beats the current state 
of the art for this task. 

Our optimizer suggests the use of more CPUs than available 
to us (1500) for the given dataset size. Interestingly, and 
not by clever experimental design, it also predicts N — 12^ 
to be the cost minimizing configuration for which a cost 
of 13, 700 CPU seconds is predicted. We in fact measure 
15,000, which is remarkably close given that our optimizer 
assumes the optimal fan-in of e and not 2 as used here for 
comparison with VW. 

6.3 Constant Fan-In 

Our theoretical analysis suggests that the optimal fan-in of 
an aggregation tree is independent of both the number of leaf 
nodes N and the transfer and processing time per object A. 
To evaluate this claim, we constructed trees with varying 
fan-in over different numbers of leaf nodes aggregating dif- 
ferent vector sizes. In Table [3] we report the minimum-time 



size/N 2 4 8 16 32 

1MB 8 5 4 5 4 

2MB 5 3 5 5 5 

4MB 5 5 4 4 4 

8MB 5 4 5 5 3 

16MB 5 4 5 5 5 

32MB 5 5 5 5 3 

64MB 4 4 5 5 5 

128MB 8 3 5 5 5 



Table 3: Optimal fan-in for combinations of vector 
size and number of leaf nodes. 




20 24 40 60 80 120 
Map slots 



Figure 3: Iteration time and cost using different 
numbers of CPUs 



fan-in found for each combination. The results show the 
minimum fan-in is constant at either 4 or 5 in the vast major- 
ity of cases. Thus, the theoretical prediction that the fan-in 
is a constant, which we have empirically verified. However, 
the empirically found optimum differs from the theoretical 
prediction e. We attribute this deviation to effects not mod- 
eled in our theory. To be precise, the addition of an aggrega- 
tion node adds a one-time (setup) cost to the system, which 
is amortized via the higher fan-in empirically. 

6.4 Optimal Partitioning 

We now evaluate the other theoretical result presented ear- 
lier: a prescription for the optimal number of machines to 
use for a given job. To create this scenario, we use only 
1/5 of our total dataset, containing 463,925,403 records. 
This amount of data (roughly 100GB in text form) can fit 
in the main memory of a subset of our 120 CPUs. For the 
characteristics of our cluster as reported in Table |2] our op- 
timizer picks N — Nmax — 120 to minimize response time 
and N — 24: to minimize cost. 

Figure [3] shows the average iteration times and costs over 
this dataset for different numbers of CPUs. All experiments 
use a fan-in of 4, as determined by the prior experiment. 
The results show that the response time is indeed minimized 
for N — 120, as predicted by our optimizer. Furthermore, 
N — 24 IS the cost minimizing configuration for this job, 
again as predicted. 



6.5 Discussion 

Our runtime and optimizer is competitive with the current 
state of the art in large scale machine learning systems. This 
is especially noteworthy as it makes fewer assumptions than 
competing systems: It neither assumes enough resources to 
cache all data (like Spark), nor does it default to read all 
data from disk (like Hadoop). Additionally, all experimen- 
tal findings were consistent with the theoretical findings pre- 
sented above. In summary, our static optimizer was able to 
pick a good plan in all combinations we tested. 

7. CONCLUSIONS 

MapReduce does not support iteration, which is important 
for machine learning tasks that are being increasingly carried 
out on Big Data in large-scale "cloud" cluster environments. 
In this paper, we argued that the right way to support it- 
eration is to fundamentally extend the MapReduce model 
with a looping construct, thereby allowing the system to 
reason about the entire program execution. We presented 
such an extension, called Iterative MapReduce. To illustrate 
the power of automatic database-style optimization, we con- 
sidered a class of Iterative MapReduce programs that can 
readily express many ML tasks, and developed an optimizer 
that automatically instantiates an efficient execution plan, 
taking into account a broad range of optimizations including 
data-local and loop-aware scheduling, data caching, serial- 
ization costs, intelligent data partitioning and resource al- 
location, and auto-configuration of the aggregation- tree for 
the reduce phase. We presented theoretical justifications 
for the two key decisions made by the optimizer on a per- 
program basis, namely data partitioning/resource allocation 
and aggregation-tree configuration, and presented empirical 
results that demonstrate our plans to be competitive with a 
specialized state-of-the-art implementation. 

Much remains to be done. The optimizer must be extended 
to cover the full range of Iterative MapReduce programs, 
and to take into account the likelihood of different kinds 
of failures in a cost-based manner. A more comprehensive 
evaluation must be carried out to establish that optimiz- 
ers can indeed be competitive with specialized state-of-the- 
art implementations for diverse ML problems. Nonetheless, 
our results are extremely encouraging in that they offer the 
promise of efficient system-driven optimization for a broad 
class of ML programs. This is especially significant given 
that programmers cannot effectively tune their programs 
in cloud systems with rapidly changing resource availabil- 
ity (thanks to multi-tenancy, elasticity, and input datasets 
that can change significantly across different runs of the 
same program). We believe that automatic system-driven 
program optimization along the lines pioneered by database 
query optimizers is the only feasible avenue for future cloud 
systems, and the results in this paper are a first step in this 
direction. 
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